Thai Paragraph Shortening Based on Binary Classification Model

نویسندگان

  • Kitsuchart Pasupa
  • Ponrudee Netisopakul
چکیده

Thai sentences can be simplified or shortened by simply cutting some words out without changing its meaning. In this paper, Linear and non-linear Fisher discriminant analysis are applied to shorten Thai paragraph in a corpus. Features used in this paper are unique word ID and part of speech of the target word, as well as its three previous and three next adjacent words, and also its role as content/function word. Two scenarios are investigated namely global model and document-specific model. The results demonstrated that both Fisher discriminant analysis and kernel Fisher discriminant analysis significantly improved classification accuracy over the baseline for both scenarios. We found that, part of speech of the target word is the most relevant feature followed by part of speech of adjacent words. Moreover, the document-specific model achieved higher accuracy than the global model. This could be an evidence that author’s writing style plays an important role in paragraph shortening task.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new classification method based on pairwise SVM for facial age estimation

This paper presents a practical algorithm for facial age estimation from frontal face image. Facial age estimation generally comprises two key steps including age image representation and age estimation. The anthropometric model used in this study includes computation of eighteen craniofacial ratios and a new accurate skin wrinkles analysis in the first step and a pairwise binary support vector...

متن کامل

Binary Paragraph Vectors

Recently Le & Mikolov described two log-linear models, called Paragraph Vector, that can be used to learn state-ofthe-art distributed representations of documents. Inspired by this work, we present Binary Paragraph Vector models: simple neural networks that learn short binary codes for fast information retrieval. We show that binary paragraph vectors outperform autoencoder-based binary codes, d...

متن کامل

A High-Performance Model based on Ensembles for Twitter Sentiment Classification

Background and Objectives: Twitter Sentiment Classification is one of the most popular fields in information retrieval and text mining. Millions of people of the world intensity use social networks like Twitter. It supports users to publish tweets to tell what they are thinking about topics. There are numerous web sites built on the Internet presenting Twitter. The user can enter a sentiment ta...

متن کامل

A Hidden Conditional Random Field-Based Approach for Thai Tone Classification

In Thai, tonal information is a crucial component for identifying the lexical meaning of a word. Consequently, Thai tone classification can obviously improve performance of Thai speech recognition system. In this article, we therefore reported our study of Thai tone classification. Based on our investigation, most of Thai tone classification studies relied on statistical machine learning approa...

متن کامل

Thai News Text Summarization and Its Application

Since Thai language lacks word/phrase/sentence boundaries, document summarization in Thai needs investigations in unit segmentation, unit selection, redundancy removal and evaluation dataset construction. In this work, we have proposed Thai Elementary Discourse Unit (TEDU) and a three-stage method of Thai multidocument summarization, i.e., unit segmentation, unit-graph formulation, and unit sel...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012